How to Approach Disaster Recovery

10/24/2010 4:35:04 PM

Often, disaster recovery specialists refer to a seven-tier disaster recovery paradigm. These tiers start at the bottom of the disaster recovery food chain—Tier 0 (no offsite data—possibly no recovery) and progress up to the highest level of disaster recovery possible—Tier 7 (zero to near-zero data loss with highly automated recovery). We developed a much more simplified and more generalized five-level representation that should aid you in understanding DR and how you can approach it more readily. Let’s start with where we think most small to midsize companies find themselves: they have little to no disaster recovery plan and are operating at a fairly high risk exposure level. When we look at Figure 1 , this places those folks in the very bottom of the upside-down pyramid at Level 1 and perhaps not even in the DR pyramid at all (below the line—no data backup offsite at all—Level 0).
Figure 1. Disaster recovery levels pyramid.

Very likely, many of these companies are running a huge risk that they may not really want to have. The good news is that getting to a more protected level with your SQL Server–based applications isn’t really that hard to do. But nothing is free. After you establish some type of DR plan and create the technical process and mechanisms for your DR, you still have to implement and test it. Now, let’s examine more closely the disaster recovery levels and what level your company is currently or needs to go to.

Level 0 is very understandable because it relates to undergoing a disaster and being able to recover from it. You basically have to pick up the pieces (after a fire or something) and see what is salvageable from your site. Even the best onsite backup plans are rendered meaningless in regard to disaster recovery if the backups were not stored somewhere safe (offsite).

You must get to Level 1 as soon as possible if you are even remotely serious about disaster recovery. We are talking about some very basic capability of creating a recoverable image (database backups, system configuration backups, user ID, permissions, role backups, and so on) that can effectively allow you to rebuild your critical applications and databases that are running on SQL Server at an alternate location. Data loss is probably involved, but it may not be enough to cause your company to completely go out of business. If you are not doing this right now, do it today!

Level 2 adds a much more real-time recovery time frame into the mix and gets to more of a point-in-time recovery capability. Using capabilities such as electronic vaulting aids greatly in your being able to restore systems (databases) rapidly and get back online to the world within a short amount of time. There is still a certain level of data loss here. That cannot be avoided.

Level 3 moves you to the complexities of recovering transactional integrity in your applications and minimizing your data losses. This level requires much more effort, resources, and sophistication, but is very doable.

Finally, many larger companies have Level 4 in place to completely protect themselves from single-site failures or disasters without missing a single order transaction. This chapter identifies the different SQL Server–based options that allow you to achieve from Level 1 through Level 4 DR. In some cases, achieving this recovery involves using a Microsoft product combined with a third-party product.

The mandate we put on you is to devise a highly efficient disaster recovery plan in support of your business continuity needs and then test it completely! You need to make sure it considers all aspects of completely being able to come up onto an alternate location as smoothly, quickly, and with as little data loss as possible. Defining a disaster recovery plan can be a tedious job owing to the potential complexities of your configuration. But, keep in mind, a disaster recovery plan is essential to your company’s sheer existence and should be treated as such. Your objective is to move up the pyramid (to Level 4 or near it) to match your company’s needs for business continuity. We don’t want to read in the trades how your company took weeks to recover from a failed data center disaster and eventually went out of business; industry statistics are against you if you haven’t prepared.

Disaster Recovery Patterns

In general, you should consider three main DR patterns when trying to achieve Level 1 through Level 4 DR.

Active/Passive DR Sites Pattern

Figure 2 illustrates the active/passive DR sites pattern. This typical disaster recovery configuration is probably the most common in the world (for those who are doing DR). It involves a primary site (the normal environments in which you do business day in and day out) and a passive DR site. This passive DR (alternative) site can be anything and anywhere. It could be a “hot” alternate site that is ready for you to apply a current database backup to (and application image, too), or it could be a “cold” alternate site that you have to either go out and lease, co-locate with, or build completely from scratch. The available resources, money, and business need determines which method you choose. Obviously, the colder your DR site, the longer it will take for you to recover using it. A typical estimate of the time needed for a small to midsized company to completely rebuild the essential systems on an alternate site (DR site) is between 23 and 31 days.

Figure 2. Active/passive DR sites pattern.

The Microsoft products to help you achieve this DR pattern are database backups taken offsite and readily recallable to recover (restore) your database tier and, if you have a “hot” DR site available, you can utilize data replication to the DR site or log shipping or even asynchronous database mirroring. Some third-party products, such as Symantec’s Veritas Volume Replicator, push physical byte-level changes to the passive (hot) DR site physical tier level. In most of these options, the DR site is passive, literally sitting there idle until it is needed. The only exception to that rule is when you are using Microsoft data replication or database snapshots with database mirroring. Even in those cases, the DR site does not support transactional change (updates, deletes, inserts); it supports read access only. Also, you need to remember that log shipping is on the way out in future Microsoft releases, so don’t plan too much new usage of that feature.

Active/Active DR Sites Pattern

An active/active DR configuration essentially provides two primary sites that can process transactions equally. Any particular external or internal usage of the applications doesn’t really know which site is processing the requests (and, in fact, doesn’t need to know). This configuration has the entire application stack completely deployed on both sites. The data is kept in sync in real-time, and either site could be lost and not affect the operation of the company. Some data loss is possible, but this is usually kept at a minimum, depending on which Microsoft options you are using. Figure 3 shows an active/active DR configuration that bidirectionally keeps the databases in sync.

Figure 3. Active/active DR sites pattern.

A few things may cause issues for this pattern, such as the need to make sure that no application keeps “state” from one transaction to the other. Additionally, the application and/or the web tier needs to be able to route user connections (the load) to either site in some type of balanced or round-robin method. This is often done with big IP routers that use round-robin routing algorithms, for example, to determine which site to direct connections to. Active/active configurations can be created using peer-to-peer continuous data replication as well as other multi-updating subscriber replication topologies. A slight twist to having two primary sites is to have one primary site and a secondary site that doesn’t process transactions but is actively used for reporting, testing, and other tasks (just no processing that changes anything). In the event of a primary failure, the secondary site can take over full primary site responsibilities quickly. This is sort of active/passive, with active “secondary usage” on the passive site (following the first active/passive DR pattern described previously). This type of configuration can take advantage of database mirroring and database snapshots (for the reporting). There are plenty of advantages to this variation, which greatly distributes the workload and moves up the DR pyramid.

Active Multisite DR Pattern

An active multisite DR configuration contains three or more active sites, with the intention of using any one of them as the DR site for the other (as shown in Figure 4 ). This pattern allows you to distribute your applications redundantly between any pair of sites, but not to all three (or more). For instance, you could have half of Primary Site 1’s applications on Primary Site 2 and the other half on Primary Site 3. This way, you spread out the risk further and increase your odds of uninterrupted processing.

Figure 4. Active multisite DR pattern.

Again, having “stateless” applications is critical here, as is some smart routing of all connections to the right sites. Using continuous data replication and the database mirroring options allows you to easily create such a DR topology. And, again, you also have the secondary usage variation available to you if one or more alternative sites were passive (with secondary usage supporting reporting, for example).

Choosing a Disaster Recovery Pattern

We reduce these to patterns because, at the foundational level, they represent what you need to do to support the level of business continuity your company demands. Some companies can tolerate different levels of loss because of the nature of their business; others cannot. At the highest levels, it is fairly easy to match these patterns to what your business requires. In this chapter, we look at what SQL Server capabilities are available to help you implement these patterns.

Often, global companies devise a DR configuration that reserves each major data center site in their regions as the active or passive DR site for another region. Figure 56.5 shows one large high-tech company’s global data center locations. Its Alexandria, Virginia, site is also the passive DR site for its Phoenix, Arizona, site. Its Paris, France, regional site is also the DR site for its Alexandria, Virginia, site, and so on.

Figure 5. Using active regional sites for passive DR.

For companies that have multiple data center sites but only need to support the active/passive DR pattern, a very popular variation can be used. This variation is called reciprocal DR. As you can see in Figure 6 , there are two sites (Site 1 and Site 2). Each is active for some applications (Applications 1, 3, and 5 on Site 1 and Applications 2, 4, and 6 on Site 2). Site 1’s applications are passively supported on Site 2, and Site 2’s applications are passively supported on Site 1. Rolling out the configuration this way eliminates the “stateless” application issue completely and is fairly easy to implement. It is also possible to provide the passive applications data available via database snapshots at the other reciprocal site (for free!), further leveraging distributing workload geographically.

Figure 6. Reciprocal DR.

This configuration also spreads out the risk of losing all applications if one site ever happens to be lost (as in a disaster). Again, the Microsoft products to help you achieve this DR pattern variation are data replication to the DR site, or log shipping, and even asynchronous database mirroring with database snapshots available to help with some distributed reporting. As we noted previously, third-party products such as Symantec’s Veritas Volume Replicator can be used to push physical byte-level changes to the passive (hot) DR site physical tier level.

Recovery Objectives

You need to understand two main recovery objectives: the point in time to which data must be restored to be able to successfully resume processing (called the recovery point objective) and the acceptable amount of downtime that is tolerable (called the recovery time objective). The recovery point objective (RPO) is often thought of as the time between the last backup and the point when the outage occurred. It indicates the amount of data that will be lost. The recovery time objective (RTO) is determined based on the acceptable downtime in case of a disruption of operations. It indicates the latest point in time at which the business operations must resume after disaster (that is, how much time can elapse).

The RPO and RTO form the basis on which a data protection strategy is developed. This helps to provide a picture of the total time that a business may lose due to a disaster. The two of them together are very important requirements when designing a solution. Let’s put these terms in the form of algorithms:

RTO = Difference between the time of the disaster to the time the system is operational – Time operational (up) – Time disaster occurred (down)
RPO = Time since the last backup of complete transactions representing data that must be re-acquired or entered – Time disaster occurred – Time of last usable data backup

Therefore:

Total lost business time = Time operational (up) – Time disaster occurred (down) – Time of the last usable data backup

Knowing your RPO and RTO requirements is essential in determining what DR pattern to use and what Microsoft options to utilize.

A Data-Centric Approach to Disaster Recovery

Disaster recovery is a complex undertaking unto itself. However, it isn’t really necessary to recover every system or application in the event of a disaster. Priorities must be set on determining exactly which systems or applications must be recovered. These are typically the revenue-generating applications (such as order entry, order fulfillment, and invoicing) that your business relies on to do basic business with its customers. Therefore, you set the highest priorities for DR with those revenue-generating systems. Then the next level of recovery is for the second-priority applications (such as HR systems).

After you prioritize which applications should be part of your DR plans, you need to fully understand what must be included in recovery to ensure that these priority applications are fully functional. The best way is to take a data-centric approach, which focuses on what data is needed to bring up the application. Data comes in many flavors, as Figure 7 shows:

Metadata— The data that describes structures, files, XSDs, and so on that the applications, middleware, or back end needs.
Configuration data— The data that the application needs to define what it must do, or the middleware needs to execute with, and so on.
Application data values— The data itself within your database files that represents the transactional data in your systems.

Figure7. Types of data and where the data resides.

As just mentioned, you first identify which applications you must include in your DR plans, and then you must make sure you back up and are able to recover that application’s data (metadata, configuration data, and application data). As part of this exercise, you must determine how tightly or loosely coupled the data is to other applications. In other words (as you can also see in Figure 7 ), if on the back-end tier, Database A has the orders transactions and Database B has the invoicing data, both must be included in the DR plans (because they are tightly coupled). In addition, you must also know how tightly or loosely coupled the application stack components are with each layer. In other words (again looking at Figure 7 ), if the ERP application (in the application tier) requires some type of middleware to be present to handle all its messaging, that middleware tier component is tightly coupled with the ERP application and so on.